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Abstract 

We present a theoretical framework for the compression of automata, which are widely used in speech 
processing and other natural language processing tasks. The framework extends to graph compression. 
Similar to stationary ergodic processes, we formulate a probabilistic process of graph and automata 
generation that captures real world phenomena and provide a universal compression scheme LZA for this 
probabilistic model. Further, we show that LZA significantly outperforms other compression techniques 
such as gzip and the UNIX compress command for several synthetic and real data sets. 


1 Introduction 


The rapid generation of data by search engines and popular online sites, which has been reported to be in the 
order of hundreds of petabytes, requires efficient storage mechanisms and better compression algorithms. 
Similarly, sophisticated models on mobile devices for tasks such as speech-to-text conversion often need 
large storage space and memory constraints on these devices demand better compression algorithms. Fur¬ 
thermore, downloading these models requires high bandwidth so transmitting a compressed version saves 
communication costs. Hence there is a need for efficient data compression both at the data warehouse level 
(petabytes of data) and at a device level (megabytes of data). 

Most of the current compression techniques have been developed for sequential data. For example, 
Huffman coding and arithmetic coding are optimal compression schemes when the underlying sequence is 


distributed independently (i.i.d.) according to some known distribution 1101. If the sequence is not generated 
according to an i.i.d. process, but generated from a stationary ergodic process, then Lempel-Ziv schemes are 
asymptotically optimal |T^21 1. In practice, a combination of these schemes are often used. For example, 
the UNIX compress command implements Lempel-Ziv-Walsh (LZW) and gzip combines Lempel-Ziv-77 
{hZll) and Huffman coding. 

However, data is often structured. For example, in a webgraph, a node represents a URL and a directed 
edge between two nodes indicates that one URL has a link to another. In social networks, a node represents 
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a user and an edge between two nodes indicates that they are friends. Finite automata and transducers 
are widely used in speech recognition, and a variety of other language processing tasks such as machine 
translation, information extraction, and parsing p8| . For example, in speech-processing automata, a path 
may correspond to a possible sentence in a language model or in a set of recognizer hypotheses (a so-called 
lattice). Often these data sets are very large. For web graphs, there are tens of billions of web pages to 
choose from. For speech processing, a large-alphabet language model may have billions of word edges. 
Hence structured data compression is useful in practice. 

A natural question is to ask how one can exploit the structure in data to develop better compression 
algorithms? Can one do better than serializing the data and applying algorithms for sequence compression? 
Surprisingly, these questions and the compression of structured data have received little attention. Motivated 
by previous examples, we focus on automata compression and, as a corollary, graph compression. 

|[5|[6l[T5| studied webgraph compression empirically. Theoretical webgraph compression was first stud¬ 
ied by Q who proposed a scheme that uses a minimum spanning tree to find similar nodes to compress. 
However, they showed that many generalizations of their problem are NP hard. Motivated by probabilis¬ 
tic models, |[8||^ showed that arithmetic coding can be used to near-optimally compress (the structure of) 
graphs generated by the Erdos-Renyi model. 

Automata compression empirically has been studied by 1111^. However, we are not aware of any 
theoretical work focused on automata compression. Our goal is three-fold: (i) propose a probabilistic 
model for automata that captures real world phenomena, (u) provide a provable universal compression 
algorithm, and (in) show experimentally that the algorithm fares well compared to techniques such as gzip 
and compress. We note that our probabilistic model can be viewed as a generalization of Erdos-Renyi 
graphs 0. 

The rest of the paper is organized as follows: in Section]^ we describe automata and their properties. In 
Section 0 we describe our probabilistic model and show how it captures many real-world applications. In 
Section 1^ we describe our proposed algorithm EZA, prove its optimality and in Section]^ we demonstrate 
the algorithm’s practicality in terms of its degree of compression. 


2 Directed Graphs and Finite Automata 

A directed graph is a pair (Q, (5) where Q = {1, 2, 3,..., n} is the set of nodes and 6: Q ^ Q* is the set 
of edges where for every node q, 6{q) is the set of nodes to which it is connected. Note that our notation for 
directed graphs is chosen to harmonize with finite automata. 

Automata generalize graphs. An unweighted automaton A is a 5-tuple {Q,T,,6,qi, F) where Q = 
{1,2,..., n} is the set of states, S = {1, 2,..., m} is a finite alphabet, (5: Q x S —)• Q* is the transition 
function, qi £ Q is the initial state, and F C Q are the final states. The transitions from state q by label a to 
states {q[,q 2 ,...} are given by 6{q, a) = {q{, If there is no transition by label a, then S{q, a) = 0. 

We use EcQxTixQ to denote the set of all transitions {q,a,q') and E[q] to denote the set of all 
transitions from state q. 
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Figure 3: An example weighted transducer. 



Figure 4: An example of isomorphic automata. The above two automata are same under the permutation 
0 —)• 0,1 —)• 2, and 2 —)■ 1 


An example of an automaton is given in Figure [T] State 0 in this simple example is the initial state 
(depicted with the bold circle) and state 1 is the final state (depicted with double circle). The strings 12 and 
222 are among those accepted by this automaton. By using symbolic labels on this automaton in place of 
the usual integers, as depicted in Figure we can interpret this automaton as the operation of a subway 
turnstile. It has two states locked and unlocked and actions (alphabet) coin and push. If the turnstile is in the 
locked state and you push, it remains locked and if you insert a coin, it becomes unlocked. If it is unlocked 
and you insert a coin it remains unlocked, but if you push once it becomes locked. 

Note that directed graphs form a subset of automata with S = {1} and hence we focus on automata 
compression. Furthermore, to be consistent with the existing automata literature, we use states to refer to 
nodes and transitions to refer to edges in both graphs and automata going forward. 

A main motivation to study automata is their application in speech and natural language processing. In 
some circumstances, transitions may be generalized to have an output label and a weight as well as the usual 
input label. Such automata, called weighted finite state transducers (FSTs), are extensively used in these 
fields |[2 18 191. An example of an FST is given in Figurej^ The string 12 is among those accepted by this 
transducer. For this input, the transducer outputs the string 23 and has weight .046875 (transitions weights 
0.75 times 0.25 times final weight 0.25). 

We propose an algorithm for unweighted automata compression. For FSTs, we use the same algorithm 
by treating the input-output label pair as a single label. If the automaton is weighted, we just add the weights 
at the end of the compressed file by using some standard representation. 
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3 Random automata compression 


3.1 Probabilistic model 

Our goal is to propose a probabilistic model for automata generation that captures real world phenomena. 
To this end, we first review probabilistic models on sequences and draw connections to probabilistic models 
for automata. 

3.1.1 Probabilistic processes on sequences 

We now define i.i.d. sampling of sequences. Let denote an n-length sequence xi,X 2 ■ ■ ■ Xn- If x'^ are 
n independent samples from a distribution p over X, then p(x”) = W^=iP{xi). Note that under i.i.d. 
sampling, the index of the sample has no importance, i.e., 

p{Xi = x) = p{Xj = x), VI < z, j < n, X G X . 

stationary ergodic processes generalizes i.i.d. sampling. For a stationary ergodic process p over sequences 

p{xr = xD=p(x^+^ = xr),vi,j,m,xr. 

Informally stationary ergodic processes are those for which only the relative position of the indices matter 
and not the actual ones. 

3.1.2 Probabilistic processes on automata 

Before deriving models for automata generation, we first discuss an invariance property of automata that 
is useful in practice. The set of strings accepted by an automaton and the time and space of its use are 
not affected by the state numbering. Two automata are isomorphic if they coincide modulo a renumbering 
of the states. Thus, automata (Q, S, S, qi, F) and (Q', S, <5', q[, F') are isomorphic, if there is a one-to-one 
mapping f: Q —> Q' such that f{5{q,a)) = 5'{f{q),a), for all q G Q and a G S, f{qi) = q[, and 
/(F) = F', where f {F) = {f{q ): q G F}. 

Under stationary ergodic processes, two sequences with the same order of observed symbols have the 
same probabilities. Similarly we wish to construct a probabilistic model of automata such that any two iso¬ 
morphic automata have the same probabilities, since the state numbering does not have explicit importance. 
For example, the probabilities of automata in Figure]^ are the same. 

There are several probabilistic models of automata and graphs that satisfy this property. Perhaps the most 
studied random model is the Erdos-Renyi model G{n,p), where each state is connected to every other state 
independently with probability p 0. Note that if two automata are isomorphic then the Erdos-Renyi model 
assigns them the same probability. The Erdos-Renyi model is analogous to i.i.d. sampling on sequences. 
We wish to generalize the Erdos-Renyi model to more realistic models of automata. 

Since the state numbering is to be disregarded, the only possible dependence of transitions from a state 
would be by the paths leading to that state. This arises naturally in language modeling tasks. Eor example 
in an n-gram model, a state might have an outgoing transition with label Francisco or Diego only if it has 
a input transition with label San. This is an example where we have restrictions on paths of length 2. In 
general, we may have restrictions on paths of any length i. 

We define an l-memory model for automata as follows. Eet be the set of paths of length at most I 
leading to the state q. The probability distribution of transitions from a state depends on the paths leading to 
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it Let S{q,*) = 5{q,l),5{q,2),... ,5{q,m). 


p{A) =p{5{l,*),5{2,*),...,5{n,*)) oc Y[p{S{Q,*)\hq)- 

q=l 

Similarly, transitions leaving a state q dissoeiate into marginals eonditioned on the history and probability 
that q' G 5{q, a) also dissoeiates into marginals. 

p{S{q,*)\hq) = Ylp{5{q,a)\h^g) 

aes 

n 

= n n ^ 

aGS q'=l 


where I{q' G S{q, a)) is the indieator of the event q' G S{q, a). Note that the probabilities are defined with 
proportionality. This is due to the probabilities possibly not adding to one. Thus we have a eonstant Z to 
ensure that it is a probability distribution. 

p{A) = p{5{l, *), 6(2, ,6(n, *)) 

n 

= -^Wp{Kq,*)\hq) 

q=l 

q=l a^T, 

Note that i-memory models assign the same probability to automata that are isomorphie. In our ealeulations, 
we restriet i to make the model traetable. 

Note that sequenees form a subset of automata as follows. For a sequenee over alphabet S, eonsider 
the automata representation with states Q = {1,2,... n}, initial state qi = 1, final slate F = {n}, alphabel 
S, and Iransilion funelion d{i,Xi) = i + 1 and 6(i,x) = 4> for all x / x*. Informally, every sequenee ean 
be represented as an aulomalon wifh line as fhe underlying slruelure. Furlhermore, nole fhaf fhe probabilily 
fhal Iwo isomorphie aufomala should have fhe same probabilily is same as slaling fhe indiees in sequenees 
do nol have explieil meaning (slalionary ergodie properly). 

3.2 Entropy and coding schemes 

A compression scheme is a mapping from X lo {0,1}* such lhal Ihe resulling code is prefix-free and can 
be uniquely recovered. For a coding scheme c, lei lc{x) denote Ihe lenglh of Ihe code for xGTf.Il is 
well-known lhal Ihe expected number of bils used by any coding scheme is Ihe enlropy of Ihe dislribulion, 
defined as H(p) Ylx&x P(^) known Huffman coding scheme achieves Ihis enlropy 

up-lo one additional bit For n-lenglh sequences arilhmelic coding is used, which achieves compression 
up-lo enlropy wilh few additional bils of error. 

The above-mentioned coding melhods such as Huffman coding and arilhmelic coding require Ihe knowl¬ 
edge of Ihe underlying dislribulion. In many practical scenarios, Ihe underlying dislribulion may be un¬ 
known and only Ihe broader class lo which Ihe dislribulion belongs may be known. For example, we mighl 
know lhal Ihe given n-lenglh sequence is generated by i.i.d. sampling of some unknown dislribulion p over 
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{1, 2,..., k}. The objective of a universal compression scheme is to asymptotically achieve H{p) bits per 
symbol even if the distribution is unknown. A coding scheme c for sequences over a class of distributions 
V is called universal if 


lim sup max 

n—^oo pS”P 


EUX^)] - H{X^) 
n 


= 0 . 


The normalization factor in the above definition is n, as the number of sequences of length n increases 
linearly with n. For automata and graphs with n states (denoted by An) we choose a scaling scaling factor 
of as the number of automata scales as exp(n^). We call a coding scheme c for automata over a class of 
distributions V universal if 


lim sup max 

n—>-00 pS'P 


E[kiAn)] - H{An) 


= 0 . 


We now describe the algorithm LZA. Note that the algorithm does not require the knowledge of the under¬ 
lying parameters or the probabilistic model. 


4 Algorithm for automaton compression 

Our algorithm recursively finds subsfrucfures over sfafes and uses a Lempel-Ziv subroutine. Our coding 
mefhod is based on fwo auxiliary techniques fo improve fhe compression rate: Elias-delfa coding and coding 
fhe differences. We briefly discuss fhese fechniques and fheir properties before describing our algorifhm. 


4.1 Elias-delta coding and coding the differences 


Elias-delfa coding is a universal compression scheme for infegers 114|. To represenf a posifive infeger x, 
Elias-delfa codes use [log xj -|- 2 [log [log xj -|- Ij + 1 bifs. To obfain a code over N U {0}, we replace x by 
X -|- land use Elias-delfa codes. 

We now use Elias-delfa codes fo obfain fo code sefs of infegers. Eet xi, X 2 , • • •, Xm be infegers such fhaf 
0 < xi < X 2 < • • • < Xm < n. We use fhe following algorifhm fo code xi, X 2 , • • ■, Xm- The decoding 
algorifhm follows from Elias-DECODE 1141. 


Algorifhm Dieeerence-Encode 

Input: Infegers 0 < xi < X2 < • • • < Xm < n. 

1. Use Elias-encode fo code xi — 0, X 2 — xi,... x^ — x^-i. 

Lemma 1 (Appendix [A|). For integers such that 0 < xi < X2, ■ ■ - Xd < n, Difeerence-Encode uses at 
most 

,1 n + d / n + d \ 

dlog —2 -^ 2 (ilog^log —2 -h ij + d 

bits. 

We firsf give an example fo illusfrale Difference-Encode’s usefulness. Consider graph represenfafion 
using adjacency lisfs. Eor every source sfafe, fhe order in which fhe desfinafion slates are slored does nol 
mailer. Eor example, if sfafe 1 is connected fo sfafes 2, 4, and 3, if suffices fo represenf fhe unordered sef 
{2,3,4}. In general if a slate is connected fo d ouf of n slates, Ihen if suffices fo encode fhe ordered sef of 
sfafes yi,y 2 , ■ ■ ■ lUd where 1 < yi < y 2 ^ Us ■■■ Ud ^ n. The number of such possible sefs is . If fhe 
slale-sels are all equally likely, Ihen fhe enlropy of slale-sels is log ([[) ^ dlog 
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If each state is represented using log n bits, then d log n > d log ^ bits are necessary, which is not 
optimal. However, by Lemmaj^ Difference-Encode uses dlog ^^(1 + o(l)) « dlog and hence is 
asymptotically optimal. Furthermore, the bounds in Lemma[T]are for the worst-case scenario and in practice 
Dieference-Encode yields much higher savings. A similar scenario arises in EZA as discussed later. 

4.2 LZA 

We now have at our disposal the tools needed to design a variant of the Eempel-Ziv algorithm for compress- 

def 

ing automata, which we denote by EZA. Eet dq = \E[q] \ be the number of transitions from state q and let 
transitions in E[q] = {(g, ai, gi), (g, 02 , 92 ), • • ■, ( 9 , are ordered as follows: for all i, qi < gj+i 

and if q^ = gj+i then a* < Oj+i. 

The algorithm is based on the observation that the ordering of the transitions leaving a state does not 
affect the definition of an automaton and works as follows. The states of the automaton are visited in a 
BFS order. For each state visited, the set of outgoing transitions are sorted based on their destination state. 
Next, the algorithm recursively finds fhe largesf overlap of fhe sefs of fransifions fhaf mafch some dictionary 
elemenf and encodes fhe pair (mafched dictionary elemenf number, nexf fransifion), and adds fhe dictionary 
elemenf fo T^, alphabef of fhe fransifion fo T^, and fhe fransifion fo T 5 . If also updafes fhe dicfionary elemenf 
by adding a new dicfionary elemenf (mafched dicfionary elemenf number, nexf fransifion) fo fhe dicfionary. 
Finally if encodes T^, Ts using Dieference-Encode and encodes each elemenf in Ts using [logm] bifs. 

Algorifhm EZA 

Input: The transition label function 6 of the automaton. 

Output: Encoded sequence S. 

1. Set dictionary D = %. 

2. Visit all states q in BFS order. For every state q do: 

(a) Code dq using [log nm] bits. 

(b) Set Td = 0, Ts = 0, and Ts = 0. 

(c) Start with y = 1 in E[q] = {(q, ai, qi), {q, 02 , 92 ), and continue till j reaches dq. 

i. Find largest I such that {aj,qj ),..., ( 0 ^+;, qj+i) G D. Eet this dictionary element be dr- 

ii. Add iaj,qj),{aj+i+i,qj+i+i) to D. 

iii. Add dr to Td, qj+i+i to Ts, and a^+z+i to T^. 

(d) Use Difference-Encode to encode Td, Ts and encode each element in Ts using [logm] bits. 
Append these sequences to S 

3. Discard the dictionary and output S. 

We note that simply compressing the unordered sets Td and Ts suffices for unique reconsfrucfion and 
fhus Difference-Encode is fhe nafural choice. Observe fhaf Difeerence-Encode is a succincf repre- 
senfafion of fhe dicfionary and does nof affecf fhe way Eempel-Ziv dicfionary is builf. Thus fhe decoding 
algorifhm follows immediafely from refracing fhe sfeps in EZA and EZ78 decoding algorifhm. 

If Dieference-Encode is nof used, fhe number of bifs used would be approximafely \D\ log \D\ + 
\D\ logn, which is sfricfly greafer fhan fhaf number of bifs in Eemmaj^ Furfhermore, we did consider 
several ofher nafural varianfs of fhis algorifhm where we difference-encode fhe sfafes firsf and fhen serialize 
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the data using a standard Lempel-Ziv algorithm. However, we could not prove asymptotic optimality for 
those variants. Proving their non-optimality requires constructing distributions for which the algorithm is 
non-optimal and is not the focus of this paper. 

We first bound the number of bits used by LZA in terms of the size of the dictionary \D\, the number of 
states n, and the alphabet size m. This bound is independent of the underlying probabilistic model. Next, 
we proceed to derive probabilistic bounds. 

Lemma 2 (Appendix [B). The total number of bits used by LZA is at most 

\D\ [log(n + 1) + log {v + 1) + 21og(log(n + 1) + 1)]+|L>| [2 log (log (i^ + 1) + 1) + 2 + [log m]]-fn[lognm], 
where u = fh- 


4.3 Proof of optimality 

In this section, we prove that LZA is asymptotically optimal for the random automata model introduced in 
Section Lemma gives an upper bound on the number of bits used in terms of the size of the dictionary 
\D\. We now present a lower bound on the entropy in terms of D which will help us prove this result. The 
proof is given in Appendix [C| 

Lemma 3 . LZA satisfies 


H{p) > E[|I)|] 


1 n\D\] , . 

log- m — log 


n 


Tifm 


E[\D\] 


+ 


The above result together with Lemma [^implies 
Theorem 4 (Appendix 


Di. If 2"^^ = o f lo'giogn )’ LZA is a universal 



compression algorithm. 


5 Experiments 

5.1 Automaton structure compression 

LZA compresses automata, but for most applications, it is sufficient to compress the automata structure. We 
convert LZA into LZA 5 , an algorithm for automata structure compression as follows. We first perform a 
breadth first search (BPS) with the initial state as the root state and relabel the states in their BPS visitation 
order. We then run LZA with the following modification. In step 2, for every state q we divide the transitions 
from q into two groups, transitions whose destination states have been traversed before in LZA and 
Tnew^ transitions whose destinations have not been traversed. Note that since the state numbers are ordered 
based on a BPS visit, the destination state numbers in Tnew are 1, 2,... n, and can be recovered easily while 
decoding and thus need not be stored. Hence, we run step 2b in LZA only on transitions in Por T^ew, 
we just compress the transition labels using LZ78. 

Since each destination state can appear in only once, the number of transitions in UgTnew < n. 
Since this number is <C n^, the normalization factor in the definition of universal compression algorithm 
for automata, the proof of Theorem extends to LZA^. Since for most applications, it is sufficient to 
compress to the automata structure, we implemented LZA^ in C-i-i- and added it to the OpenFst open-source 
library Q. 









Class 

LZAs 

compress 

gzip 

EZA-|-gzip 

Gi 

18260 

22681 

23752 

17320 

Ai 

21745 

33478 

31682 

21108 

G 2 

2536 

4994 

4564 

2443 

A 2 

3027 

6707 

5546 

2940 


Table 1: Synthetic data compression examples (in bytes). 


5.2 Comparison 

The best known convergence rates of all Lempel-Ziv algorithms for sequences are O LZA 

has the same convergence rate under the ^-memory probabilistic model. 

However in practice data sets have finitely many states and the underlying automata may not be gener¬ 
ated from an ^-memory probabilistic model. To prove the practicality of the algorithm, we compare LZA^ 
with the Unix compress command (LZ78) and gzip (Lempel-Ziv-Walsh and Huffman coding) for various 
synthetic and real data sets. 

5.2.1 Synthetic Data 

While the ^-memory probabilistic model illustrates a broad class of probabilistic models on which LZA is 
universal, generating samples from an ^-memory model is difficult as the normalization factor Z is hard to 
compute. We therefore test our algorithm on a few simpler synthetic data sets. In all our experiments the 
number of states is 1000 and the results are averaged over 1000 runs. 

Table [T] summarizes our results for a few synthetic data sets, specified in byfes. Nofe fhaf one of fhe 
main advanfages of LZA 5 over exisfing algorifhms is fhaf LZA 5 jusf compresses fhe sfrucfure, which is 
sufficienf for applicafions in speech processing and language modeling. Furthermore, nofe fhaf fo obfain fhe 
acfual aufomafon from fhe sfrucfure we need fhe original slate numbering, which can be specified in n log n 
bifs, which is less fhan 1250 bytes in our experimenfs. Even if we add 1250 bifs fo our resulfs in Table [T] 
LZA still performs belter fhan gzip and compress. 

We run fhe algorilhm on four differenl synlhefic dala sefs Gi, G 2 , Ai, A 2 . Gi and Ai are models wilh 
a uniform oul-degree dislribulion over fhe slales and G 2 and A 2 are models wilh a non-uniform out-degree 
distribution: 

Gi: directed Erdos-Renyi graphs where we randomly generate transitions between every source-destination 
pair with probability 1 / 100 . 

All automata version of Erdos-Renyi graphs, where there is a transition between every two states with 
probability 1/100 and the transition labels are chosen independently from an alphabet of size 10 for each 
transition. 

G 2 ' We first assign each state a class c G {1, 2,... 1000} randomly. We connect every two states s and 
d with probability l/(cs -|- c^). This ensures that the graph has degrees varying from 2 to log 1000. 

A 2 : we generate the transitions as above and we label each transition to be a deterministic function of 
the destination state. This is similar to n-gram models, where the the destination state determines the label. 
Here again we chose |S| = 10. 

Note that EZA always performs better than the standard Eempel-Ziv-based algorithms gzip and com¬ 
press. Note that algorithms designed with specific knowledge of fhe underlying model can achieve hel¬ 
ler performance. Eor example, for Gi, arilhmelic coding can be used fo obfain a compressed file size of 
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uncompressed size rank 


Figure 5: Real-world compression examples. 


n^/i(0.01)/8 Ri 10000 bytes. However the same algorithm would not perform well for G 2 or A 2 

5.2.2 Real-World Data 


We also tested our compression algorithm on a variety of ‘real-world’ automata drawn from various speech 
and natural language applications. These include large speech recognition language models and decoder 
graphs text normalization grammars for text-to-speech p0| , speech recognition and machine trans¬ 
lation lattices | [T7| , and pair n-gram grapheme-to-phoneme models |[7|. We selected approximately eighty 
such automata from these tasks and removed their weights and output labels (if any), since we focus here 
on unweighted automata. Figure shows the compressed sizes of these automata, ordered by their uncom¬ 
pressed (adjacency-list) size rank, with the same set of compression algorithms presented in the synthetic 
case. At the smallest sizes, gzip out-performs LZA, but after about 100 kbytes in compressed size, LZA is 
better. Overall, the combination of LZA and gzip performs best. 
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A Proof of Lemma [I] 


Since 0 is included in the set, the number of bits used to represent x is upper bounded by 9{x) = log(x + 
1 ) + 2 log(log(x + 1) + 1)J + 1. Observe that 0 is a concave function since both log and x i—)■ log(log x) 
are concave. Let xq = 0. Then, by the concavity of 9, the total number of bits B used can be bounded as 
follows: 

d 

B < '^9{xi - Xi-i) 

i=l 



2=1 


where we used for the last inequality Xn < n and the fact that 9 is an increasing function. This completes 
the proof of the lemma. 


B Proof of Lemma |2] 


Let kq be the number of elements added to the dictionary when state q is visited by LZA. The maximum 
value of the destination state is n. Thus, by Lemma [TJ the number of bits used to code T5 is at most 

n 

n=^ ^ 


where 9{-) is the function introduced in the proof of Lemma [I] Similarly, since the maximum value of any 
dictionary element is |L)|, by Lemma[^ the number of bits used to code is at most 


Em 


q=l 



By concavity these summations are maximized when kq = — for all q. Plugging in that expression in the 
sums above yields the following upper bound on the maximum number of bits used: 


\D\9{i') + \D\ [logm] + \D\9{n). 


Additionally, this number must be augmented by n [log nm] since [log nm] bits are used to encode each 
dq, which completes the proof. 


C Proof of Lemma |3] 

One of the main technical tools we use is Ziv’s inequality, which is stated below. 
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Lemma 5 (Variation of Ziv’s inequality). For a probability distribution p over non-negative integers with 
mean pi, 

H{p) < log(// + 1) + 1. 

The next lemma bounds the probability of disjoint events under different distributions. 

Lemma 6. If Ai, A 2 ,..., Ak be a set of disjoint events. Then for a set of distributions pi,p 2 , ■ ■ ■ Pr, 

k r r 

i=i j=i j=i 

We now lower bound H{p) in terms of the number of dictionary elements. 

def 

Let dq = I be the number of transitions from state q and let transitions in 
E[q] = {(g,ai,gi), {q,a 2 ,q 2 ), • • •, are ordered as follows: foralH, qi < and if = qi+i 

def 

then ai < Oj+i. To simplify the discussion, we will use the shorthand Cg,* = {q,ai,qi). Then, by the 
definition of our probabilistic model. 


logp{A) = y^logp(eg,i,eg, 2 ,---,eg,rfJ/ig) -logZ. 

q=l 

We group the transitions the way LZA constructed the dictionary. Let be the set of dictionary 

elements added when state q is visited during the execution of the algorithm. For a dictionary element Dq^i, 
let Sq^qi be the starting Cq^i and fg g/ the terminal Cg^j. Then, by the independence of the transition labels and 
the fact that Z > 1, 


log p{A) ^^^^ogp{eq,s,,i,eq,s, _i + l) • • • 6g,tg J/ig). 

^1 ^q,i 

Let pq^i = tq^i — Sq^i- Wc group them now with gq^i and Sg j. Let V{s,g) be the set of dictionary elements 
Dq^i with Sq^i = s and gq^i = g and let Cg^g be the cardinality of that set: Cg^g = \'T>{s, g)\. Then, by Jensen’s 
inequality, we can write 

n 

EE logp(eg,s,.i, eg,s,,i+l, • • • eg,i,,J/ig) 

Q — 1 

n n n 

= EEE E logp(6g^5, s+q\h\) 

g=l s=l 9=1 Dq^i&V{s,g) 

n n 1 ^ 

= EE--.^E E 10gp(eg^S, Cq^g^-l, . . . Cg^ s+g\h^q) 

s=l g=l ^’5 g=l Dq^i&V{s,g) 

n n 1 ^ 

SEE Cg^g log - - E E P{(^q,s: Cg,s+1) • • • 6g,s+g|^g) 

s=l g=l q=l Dq^i&V{s,g) 

1 1 

.<5=1 n=l 
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where the last inequality follows by Lemma the fact that the events in each summation are disjoint and 
mutually exclusive and that the number of possible histories is < 2”^ . We now have ^ Cs,g = \D\. 


Thus, 


Cs ,9 log 

s=l g=l 


-)m^ 


S,g 
n n 


\D\m^ + ^^Cs,g log^ 

6=1 3=1 ^^’9 

n n I 

\D\m^ - \D\ log \D\ + 1^1 X] X] ^ ■ 


6=1 3=1 




= \D\m^ - \D\ log \D\ + \D\H{cs,g). 

Let Cs and Cg be the projections of Cs,g into first and second coordinates. Then, we can write 

H{cs,g) < H{cs) + H{cg) < \ogn + H{cg). 

Using Ylsg^s,99 — by Ziv’s inequality, the following holds: H{cg) < log + l)- Combining 
this with the previous inequalities gives 


logp(A) < \D\ 


rrr + log 


n^m 


. , \D\ 

nl + 1 + 1 - log ^— 

D\ n 


Taking the expectation of both sides, next using the concavity of \D\ i—)> —\D\ log(|U|) and Jensen’s in¬ 
equality yield 


H{p) > E[|U|] 


1 n\D\] e , 

log- m — log 


n 


/ n^m 

VeM 


+ 1-1 


D Proof of Theorem |4] 

We first upper bound E[|U|] using Lemma|^ 

Lemma 7. For the dictionary D generated by LZA 

10 n^mlog(m + 1)2’^ 


n\D\] < 


log n 


Proof. An automaton is a random variable over transition labels each taking at most m”* + 1 values, 
hence H{p) < n^mlog(m + 1). Combining this inequality with Lemmaj^yields 


n^mlog(m + 1) > E[|Z1|] 


1 n\D\] I , 

log- m — log 


n 


/ iT?m 

VeM 


+ 1-1 


£ 

Now, let U = ^0”- — and assume that the inequality E[|I1|] > U holds. Then, the following 
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inequalities hold: 


E[\D\] 

> U 

> U 


log- m — log 


n 


n 


E[|Z)|] 


+ 1-1 


lOmnlogfm + 1)2™' « 

log-^- m 


logn 


log n — log 


logn 


10 log(m + 1)2*^ 


+ U 

+ 1 


-log 


logn 


101 og(m + l)h 
— U [log log n] 


+ 1-1 


> n^mlog(m, + 1), 


which leads to a contradiction. This completes the proof of the lemma. 


□ 


We now have all the tools to prove TheoremLet W{\D\)he the upper bound in LemmaSince we 
have a probabilistic model and the fact that W is concave in \D\, the expected number of bits 

E[lLDAiAn)] < W^(E[|^|]). 

Substituting the lower bound on H{p) from Lemma|^and rearranging terms, we have 
IE[^LZA(an)] - H{p) 


max 

P{An) 




= max 

P(A-n) 


WiE[\D\])-H{p) 




E[|I1|] 


< 2 


, ((n + l)n f Ti?m A ^ ,,0 

‘“MiEiiir'w+V 


+ 


n\D\] 




2 log log 


n 


E[|I1|] 


+ 1 +1 +7n+4 + log m 


+ 


[ log nm] 


n 


= 02 


e log log n + m^ 


logn 


The last equality follows from Lemma As n —)• cx), the bound goes to 0 and hence LZA is a universal 
compression algorithm. 


15 




























